Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
J Biomed Inform ; 150: 104600, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38301750

RESUMO

BACKGROUND: Lack of trust in artificial intelligence (AI) models in medicine is still the key blockage for the use of AI in clinical decision support systems (CDSS). Although AI models are already performing excellently in systems medicine, their black-box nature entails that patient-specific decisions are incomprehensible for the physician. Explainable AI (XAI) algorithms aim to "explain" to a human domain expert, which input features influenced a specific recommendation. However, in the clinical domain, these explanations must lead to some degree of causal understanding by a clinician. RESULTS: We developed the CLARUS platform, aiming to promote human understanding of graph neural network (GNN) predictions. CLARUS enables the visualisation of patient-specific networks, as well as, relevance values for genes and interactions, computed by XAI methods, such as GNNExplainer. This enables domain experts to gain deeper insights into the network and more importantly, the expert can interactively alter the patient-specific network based on the acquired understanding and initiate re-prediction or retraining. This interactivity allows us to ask manual counterfactual questions and analyse the effects on the GNN prediction. CONCLUSION: We present the first interactive XAI platform prototype, CLARUS, that allows not only the evaluation of specific human counterfactual questions based on user-defined alterations of patient networks and a re-prediction of the clinical outcome but also a retraining of the entire GNN after changing the underlying graph structures. The platform is currently hosted by the GWDG on https://rshiny.gwdg.de/apps/clarus/.


Assuntos
Sistemas de Apoio a Decisões Clínicas , Médicos , Humanos , Inteligência Artificial , Redes Neurais de Computação , Algoritmos , Tolnaftato
2.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37988152

RESUMO

SUMMARY: Federated learning enables collaboration in medicine, where data is scattered across multiple centers without the need to aggregate the data in a central cloud. While, in general, machine learning models can be applied to a wide range of data types, graph neural networks (GNNs) are particularly developed for graphs, which are very common in the biomedical domain. For instance, a patient can be represented by a protein-protein interaction (PPI) network where the nodes contain the patient-specific omics features. Here, we present our Ensemble-GNN software package, which can be used to deploy federated, ensemble-based GNNs in Python. Ensemble-GNN allows to quickly build predictive models utilizing PPI networks consisting of various node features such as gene expression and/or DNA methylation. We exemplary show the results from a public dataset of 981 patients and 8469 genes from the Cancer Genome Atlas (TCGA). AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/pievos101/Ensemble-GNN, and the data at Zenodo (DOI: 10.5281/zenodo.8305122).


Assuntos
Metilação de DNA , Aprendizado de Máquina , Humanos , Redes Neurais de Computação , Mapas de Interação de Proteínas , Software
3.
J Biomed Inform ; 147: 104497, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37777164

RESUMO

A log-likelihood based co-occurrence analysis of ∼1.9 million de-identified ICD-10 codes and related short textual problem list entries generated possible term candidates at a significance level of p<0.01. These top 10 term candidates, consisting of 1 to 5-grams, were used as seed terms for an embedding based nearest neighbor approach to fetch additional synonyms, hypernyms and hyponyms in the respective n-gram embedding spaces by leveraging two different language models. This was done to analyze the lexicality of the resulting term candidates and to compare the term classifications of both models. We found no difference in system performance during the processing of lexical and non-lexical content, i.e. abbreviations, acronyms, etc. Additionally, an application-oriented analysis of the SapBERT (Self-Alignment Pretraining for Biomedical Entity Representations) language model indicates suitable performance for the extraction of all term classifications such as synonyms, hypernyms, and hyponyms.


Assuntos
Idioma , Processamento de Linguagem Natural , Funções Verossimilhança , Análise por Conglomerados
4.
J Biomed Inform ; 143: 104406, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37257630

RESUMO

Multi-view clustering methods are essential for the stratification of patients into sub-groups of similar molecular characteristics. In recent years, a wide range of methods have been developed for this purpose. However, due to the high diversity of cancer-related data, a single method may not perform sufficiently well in all cases. We present Parea, a multi-view hierarchical ensemble clustering approach for disease subtype discovery. We demonstrate its performance on several machine learning benchmark datasets. We apply and validate our methodology on real-world multi-view patient data, comprising seven types of cancer. Parea outperforms the current state-of-the-art on six out of seven analysed cancer types. We have integrated the Parea method into our Python package Pyrea (https://github.com/mdbloice/Pyrea), which enables the effortless and flexible design of ensemble workflows while incorporating a wide range of fusion and clustering algorithms.


Assuntos
Algoritmos , Neoplasias , Humanos , Análise por Conglomerados , Neoplasias/genética , Aprendizado de Máquina
5.
Stud Health Technol Inform ; 302: 788-792, 2023 May 18.
Artigo em Inglês | MEDLINE | ID: mdl-37203496

RESUMO

Clinical information systems have become large repositories for semi-structured and partly annotated electronic health record data, which have reached a critical mass that makes them interesting for supervised data-driven neural network approaches. We explored automated coding of 50 character long clinical problem list entries using the International Classification of Diseases (ICD-10) and evaluated three different types of network architectures on the top 100 ICD-10 three-digit codes. A fastText baseline reached a macro-averaged F1-score of 0.83, followed by a character-level LSTM with a macro-averaged F1-score of 0.84. The top performing approach used a downstreamed RoBERTa model with a custom language model, yielding a macro-averaged F1-score of 0.88. A neural network activation analysis together with an investigation of the false positives and false negatives unveiled inconsistent manual coding as a main limiting factor.


Assuntos
Idioma , Redes Neurais de Computação , Classificação Internacional de Doenças , Registros Eletrônicos de Saúde
6.
Stud Health Technol Inform ; 302: 827-828, 2023 May 18.
Artigo em Inglês | MEDLINE | ID: mdl-37203508

RESUMO

A semi-structured clinical problem list containing ∼1.9 million de-identified entries linked to ICD-10 codes was used to identify closely related real-world expressions. A log-likelihood based co-occurrence analysis generated seed-terms, which were integrated as part of a k-NN search, by leveraging SapBERT for the generation of an embedding representation.


Assuntos
Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Funções Verossimilhança
7.
Sci Rep ; 12(1): 16857, 2022 10 07.
Artigo em Inglês | MEDLINE | ID: mdl-36207536

RESUMO

Machine learning methods can detect complex relationships between variables, but usually do not exploit domain knowledge. This is a limitation because in many scientific disciplines, such as systems biology, domain knowledge is available in the form of graphs or networks, and its use can improve model performance. We need network-based algorithms that are versatile and applicable in many research areas. In this work, we demonstrate subnetwork detection based on multi-modal node features using a novel Greedy Decision Forest (GDF) with inherent interpretability. The latter will be a crucial factor to retain experts and gain their trust in such algorithms. To demonstrate a concrete application example, we focus on bioinformatics, systems biology and particularly biomedicine, but the presented methodology is applicable in many other domains as well. Systems biology is a good example of a field in which statistical data-driven machine learning enables the analysis of large amounts of multi-modal biomedical data. This is important to reach the future goal of precision medicine, where the complexity of patients is modeled on a system level to best tailor medical decisions, health practices and therapies to the individual patient. Our proposed explainable approach can help to uncover disease-causing network modules from multi-omics data to better understand complex diseases such as cancer.


Assuntos
Algoritmos , Aprendizado de Máquina , Biologia Computacional/métodos , Humanos , Medicina de Precisão , Biologia de Sistemas
8.
Bioinformatics ; 38(Suppl_2): ii120-ii126, 2022 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-36124793

RESUMO

MOTIVATION: The tremendous success of graphical neural networks (GNNs) already had a major impact on systems biology research. For example, GNNs are currently being used for drug target recognition in protein-drug interaction networks, as well as for cancer gene discovery and more. Important aspects whose practical relevance is often underestimated are comprehensibility, interpretability and explainability. RESULTS: In this work, we present a novel graph-based deep learning framework for disease subnetwork detection via explainable GNNs. Each patient is represented by the topology of a protein-protein interaction (PPI) network, and the nodes are enriched with multi-omics features from gene expression and DNA methylation. In addition, we propose a modification of the GNNexplainer that provides model-wide explanations for improved disease subnetwork detection. AVAILABILITY AND IMPLEMENTATION: The proposed methods and tools are implemented in the GNN-SubNet Python package, which we have made available on our GitHub for the international research community (https://github.com/pievos101/GNN-SubNet). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , Mapas de Interação de Proteínas , Humanos
9.
Stud Health Technol Inform ; 294: 137-138, 2022 May 25.
Artigo em Inglês | MEDLINE | ID: mdl-35612038

RESUMO

Feature selection is a fundamental challenge in machine learning. For instance in bioinformatics, it is essential when one wishes to detect biomarkers. Tree-based methods are predominantly used for this purpose. In this paper, we study the stability of the feature selection methods BORUTA, VITA, and RRF (regularized random forest). In particular, we investigate the feature ranking instability of the associated stochastic algorithms. For stabilization of the feature ranks, we propose to compute consensus values from multiple feature selection runs, applying rank aggregation techniques. Our results show that these consolidated features are more accurate and robust, which helps to make practical machine learning applications more trustworthy.


Assuntos
Algoritmos , Aprendizado de Máquina , Biomarcadores , Biologia Computacional/métodos
10.
Kunstliche Intell (Oldenbourg) ; 36(3-4): 271-285, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36590103

RESUMO

Graph Neural Networks (GNN) show good performance in relational data classification. However, their contribution to concept learning and the validation of their output from an application domain's and user's perspective have not been thoroughly studied. We argue that combining symbolic learning methods, such as Inductive Logic Programming (ILP), with statistical machine learning methods, especially GNNs, is an essential forward-looking step to perform powerful and validatable relational concept learning. In this contribution, we introduce a benchmark for the conceptual validation of GNN classification outputs. It consists of the symbolic representations of symmetric and non-symmetric figures that are taken from a well-known Kandinsky Pattern data set. We further provide a novel validation framework that can be used to generate comprehensible explanations with ILP on top of the relevance output of GNN explainers and human-expected relevance for concepts learned by GNNs. Our experiments conducted on our benchmark data set demonstrate that it is possible to extract symbolic concepts from the most relevant explanations that are representative of what a GNN has learned. Our findings open up a variety of avenues for future research on validatable explanations for GNNs.

11.
J Biomed Inform ; 113: 103636, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33271342

RESUMO

Recent advances in multi-omics clustering methods enable a more fine-tuned separation of cancer patients into clinical relevant clusters. These advancements have the potential to provide a deeper understanding of cancer progression and may facilitate the treatment of cancer patients. Here, we present a simple hierarchical clustering and data fusion approach, named HC-fused, for the detection of disease subtypes. Unlike other methods, the proposed approach naturally reports on the individual contribution of each single-omic to the data fusion process. We perform multi-view simulations with disjoint and disjunct cluster elements across the views to highlight fundamentally different data integration behavior of various state-of-the-art methods. HC-fused combines the strengths of some recently published methods and shows superior performance on real world cancer data from the TCGA (The Cancer Genome Atlas) database. An R implementation of our method is available on GitHub (pievos101/HC-fused).


Assuntos
Algoritmos , Neoplasias , Análise por Conglomerados , Bases de Dados Factuais , Humanos , Neoplasias/genética
12.
Mol Ecol Resour ; 20(6): 1597-1609, 2020 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-32639602

RESUMO

In recent years, genome-scan methods have been extensively used to detect local signatures of selection and introgression. Most of these methods are either designed for one or the other case, which may impair the study of combined cases. Here, we introduce a series of versatile genome-scan methods applicable for both cases, the detection of selection and introgression. The proposed approaches are based on nonparametric k-nearest neighbour (kNN) techniques, while incorporating pairwise Fixation Index (FST ) and pairwise nucleotide differences (dxy ) as features. We benchmark our methods using a wide range of simulation scenarios, with varying parameters, such as recombination rates, population background histories, selection strengths, the proportion of introgression and the time of gene flow. We find that kNN-based methods perform remarkably well compared with the state-of-the-art. Finally, we demonstrate how to perform kNN-based genome scans on real-world genomic data using the population genomics R-package popgenome.


Assuntos
Simulação por Computador , Genoma , Genômica , Modelos Genéticos , Fluxo Gênico , Genética Populacional , Metagenômica , Polimorfismo de Nucleotídeo Único , Seleção Genética
13.
BMC Med Inform Decis Mak ; 19(Suppl 3): 72, 2019 04 04.
Artigo em Inglês | MEDLINE | ID: mdl-30943968

RESUMO

BACKGROUND: The amount of patient-related information within clinical information systems accumulates over time, especially in cases where patients suffer from chronic diseases with many hospitalizations and consultations. The diagnosis or problem list is an important feature of the electronic health record, which provides a dynamic account of a patient's current illness and past history. In the case of an Austrian hospital network, problem list entries are limited to fifty characters and are potentially linked to ICD-10. The requirement of producing ICD codes at each hospital stay, together with the length limitation of list items leads to highly redundant problem lists, which conflicts with the physicians' need of getting a good overview of a patient in short time. This paper investigates a method, by which problem list items can be semantically grouped, in order to allow for fast navigation through patient-related topic spaces. METHODS: We applied a minimal language-dependent preprocessing strategy and mapped problem list entries as tf-idf weighted character 3-grams into a numerical vector space. Based on this representation we used the unweighted pair group method with arithmetic mean (UPGMA) clustering algorithm with cosine distances and inferred an optimal boundary in order to form semantically consistent topic spaces, taking into consideration different levels of dimensionality reduction via latent semantic analysis (LSA). RESULTS: With the proposed clustering approach, evaluated via an intra- and inter-patient scenario in combination with a natural language pipeline, we achieved an average compression rate of 80% of the initial list items forming consistent semantic topic spaces with an F-measure greater than 0.80 in both cases. The average number of identified topics in the intra-patient case (µIntra = 78.4) was slightly lower than in the inter-patient case (µInter = 83.4). LSA-based feature space reduction had no significant positive performance impact in our investigations. CONCLUSIONS: The investigation presented here is centered on a data-driven solution to the known problem of information overload, which causes ineffective human-computer interactions at clinicians' work places. This problem is addressed by navigable disease topic spaces where related items are grouped and the topics can be more easily accessed.


Assuntos
Análise por Conglomerados , Gerenciamento de Dados/métodos , Registros Eletrônicos de Saúde , Áustria , Humanos , Classificação Internacional de Doenças , Semântica , Interface Usuário-Computador
14.
BMC Bioinformatics ; 20(1): 207, 2019 Apr 23.
Artigo em Inglês | MEDLINE | ID: mdl-31014244

RESUMO

BACKGROUND: Research over the last 10 years highlights the increasing importance of hybridization between species as a major force structuring the evolution of genomes and potentially providing raw material for adaptation by natural and/or sexual selection. Fueled by research in a few model systems where phenotypic hybrids are easily identified, research into hybridization and introgression (the flow of genes between species) has exploded with the advent of whole-genome sequencing and emerging methods to detect the signature of hybridization at the whole-genome or chromosome level. Amongst these are a general class of methods that utilize patterns of single-nucleotide polymorphisms (SNPs) across a tree as markers of hybridization. These methods have been applied to a variety of genomic systems ranging from butterflies to Neanderthals to detect introgression, however, when employed at a fine genomic scale these methods do not perform well to quantify introgression in small sample windows. RESULTS: We introduce a novel method to detect introgression by combining two widely used statistics: pairwise nucleotide diversity dxy and Patterson's D. The resulting statistic, the distance fraction (df), accounts for genetic distance across possible topologies and is designed to simultaneously detect and quantify introgression. We also relate our new method to the recently published fd and incorporate these statistics into the powerful genomics R-package PopGenome, freely available on GitHub (pievos101/PopGenome) and the Comprehensive R Archive Network (CRAN). The supplemental material contains a wide range of simulation studies and a detailed manual how to perform the statistics within the PopGenome framework. CONCLUSION: We present a new distance based statistic df that avoids the pitfalls of Patterson's D when applied to small genomic regions and accurately quantifies the fraction of introgression (f) for a wide range of simulation scenarios.


Assuntos
Genômica/métodos , Hibridização Genética/genética , Modelos Genéticos , Sequenciamento Completo do Genoma/métodos , Bases de Dados Genéticas , Fluxo Gênico , Modelos Estatísticos , Polimorfismo de Nucleotídeo Único
15.
Stud Health Technol Inform ; 248: 100-107, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29726425

RESUMO

Patients with multiple disorders usually have long diagnosis lists, constitute by ICD-10 codes together with individual free-text descriptions. These text snippets are produced by overwriting standardized ICD-Code topics by the physicians at the point of care. They provide highly compact expert descriptions within a 50-character long text field frequently not assigned to a specific ICD-10 code. The high redundancy of these lists would benefit from content-based categorization within different hospital-based application scenarios. This work demonstrates how to accurately group diagnosis lists via a combination of natural language processing and hierarchical clustering with an overall F-measure value of 0.87. In addition, it compresses the initial diagnosis list up to 89%. The manuscript discusses pitfall and challenges as well as the potential of a large-scale approach for tackling this problem.


Assuntos
Registros Eletrônicos de Saúde , Classificação Internacional de Doenças , Processamento de Linguagem Natural , Humanos
16.
Bioinformatics ; 34(18): 3205-3207, 2018 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-29718170

RESUMO

Summary: The fixation index FST can be used to identify non-neutrally evolving loci from genome-scale SNP data across two or more populations. Recent years have seen the development of sophisticated approaches to estimate FST based on Markov-Chain Monte-Carlo simulations. Here, we present a vectorized R implementation of an extension of the widely used BayeScan software for codominant markers, adding the option to group individual SNPs into pre-defined blocks. A typical application of this new approach is the identification of genomic regions, genes, or gene sets containing SNPs that evolved under directional selection. Availability and implementation: The R implementation of our method, which builds on the powerful population genetics and genomics software PopGenome, is available freely from CRAN. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Teorema de Bayes , Genética Populacional , Genômica , Genoma , Polimorfismo de Nucleotídeo Único , Software
17.
Bioinformatics ; 31(3): 413-5, 2015 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-25273104

RESUMO

SUMMARY: The statistical programming language R has become a de facto standard for the analysis of many types of biological data, and is well suited for the rapid development of new algorithms. However, variant call data from population-scale resequencing projects are typically too large to be read and processed efficiently with R's built-in I/O capabilities. WhopGenome can efficiently read whole-genome variation data stored in the widely used variant call format (VCF) file format into several R data types. VCF files can be accessed either on local hard drives or on remote servers. WhopGenome can associate variants with annotations such as those available from the UCSC genome browser, and can accelerate the reading process by filtering loci according to user-defined criteria. WhopGenome can also read other Tabix-indexed files and create indices to allow fast selective access to FASTA-formatted sequence files. AVAILABILITY AND IMPLEMENTATION: The WhopGenome R package is available on CRAN at http://cran.r-project.org/web/packages/WhopGenome/. A Bioconductor package has been submitted. CONTACT: lercher@cs.uni-duesseldorf.de.


Assuntos
Algoritmos , Variação Genética , Genoma Humano , Genômica/métodos , Anotação de Sequência Molecular , Software , Humanos
18.
Mol Biol Evol ; 31(7): 1929-36, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24739305

RESUMO

Although many computer programs can perform population genetics calculations, they are typically limited in the analyses and data input formats they offer; few applications can process the large data sets produced by whole-genome resequencing projects. Furthermore, there is no coherent framework for the easy integration of new statistics into existing pipelines, hindering the development and application of new population genetics and genomics approaches. Here, we present PopGenome, a population genomics package for the R software environment (a de facto standard for statistical analyses). PopGenome can efficiently process genome-scale data as well as large sets of individual loci. It reads DNA alignments and single-nucleotide polymorphism (SNP) data sets in most common formats, including those used by the HapMap, 1000 human genomes, and 1001 Arabidopsis genomes projects. PopGenome also reads associated annotation files in GFF format, enabling users to easily define regions or classify SNPs based on their annotation; all analyses can also be applied to sliding windows. PopGenome offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination. PopGenome is linked to Hudson's MS and Ewing's MSMS programs to assess statistical significance based on coalescent simulations. PopGenome's integration in R facilitates effortless and reproducible downstream analyses as well as the production of publication-quality graphics. Developers can easily incorporate new analyses methods into the PopGenome framework. PopGenome and R are freely available from CRAN (http://cran.r-project.org/) for all major operating systems under the GNU General Public License.


Assuntos
Metagenômica/métodos , Software , Arabidopsis/genética , Variação Genética , Genoma Humano , Genoma de Planta , Humanos , Polimorfismo de Nucleotídeo Único , Navegador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...